The data set includes 4898 white wine samples. For each record, the inputs include objective tests (e.g. PH values) and the output is the wine quality between 0 (very bad) and 10 (very excellent) graded by the wine experts.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
There are 11 input variables associated with the white wine quality. And all of them are of type “number”.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Most of the wines are rated between 5 - 7. Only 5 samples are rated at 9. None of the wine has a full score of 10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH of the wines is in normal distribution.
The wines have quite narrow density range, most of them are between 0.99 - 1.00. When the binwidth is very small(0.00001), it is hard to discover the distribution pattern of the density.
With a larger binwidth, we find the density distribution is also normal distribution. Because the range of density is very narrow, we devided the density into different groups as follow.
rom the density group plot, we can easily see most of the wines have density between 0.9917-0.0067, only a very small amount of wines have a density larger than 1.002.
##
## (0.9917,0.9967] (0.9967,1.002] (1.002,1.007] (1.007,1.012]
## 2675 995 6 2
## (1.012,1.017] (1.017,1.022] (1.022,1.027] (1.027,1.032]
## 0 0 0 0
## (1.032,1.037] <NA>
## 0 1220
We also find 1220 of 4898 wines are missing the density information in the database, so density may not be a variable we want to explore in the following analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The distribution of residual.sugar seems different from other varialbles we explored before. It is not following normal distribution in the above plot.
To further explore the residual.sugar variable, we change the x scale to log10. Then we see it falls into two different groups(low and high). And for each group, it looks like normal distribution
From the above analysis, we see the outliers of the residual.sugar(few wine samples have vary large residual.sugar compared with the others). In this case, boxplot is a good way to depict the ourliers of the variable.
To draw a relation between free.sulfur.dioxide and total.sulfur.dioxide, we create another variable “ratio.sulfur.dioxide”, which is defined by the ratio of free and total sulfur.dioxide.
Alcohol is not like normal distribution, we devide it into different group as follows: unlike density, we have complete data of alcohol in the database, with only 2 NA record.
## (8,10] (10,12] (12,14] (14,16] NA's
## 2083 2102 709 2 2
The data set includes 4898 white wine samples. For each record, the inputs include 11 objective tests (e.g. PH values) and the output is the wine quality between 0 (very bad) and 10 (very excellent) graded by the wine experts.
The wine quality is the main feature of interest. From the dataset, we want to explore the relation between different objective tests and the associated quality score.
We explore the distributions of different variables, e.g. pH, density, alcohal, residential sugar, free sulfer dioxide and total sulfer dioxide. There are features of the wine and we are trying to determine which features contribute to higher quality score
I created a new variable: ratio_sulfer_dioxide, which is defined by the ratio of free_sulfer_dioxide and total_sulfer_dioxide. This dimensionless variable will help us to understand more of the relation between free_sulfer_dioxide and total_sulfer_dioxide. It also make it comparable between different wines.
The distribution of residual.sugar seems different from other varialbles we explored before. It is not following normal distribution when I first plotted it. When I change the x-scale to log10, I find it has two different groups and each follows normal distribution.
## 'data.frame': 4898 obs. of 17 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ density.group : Factor w/ 9 levels "(0.9917,0.9967]",..: 2 1 1 1 1 1 1 2 1 1 ...
## $ ratio.sulfur.dioxide: num 0.265 0.106 0.309 0.253 0.253 ...
## $ alcohol.group : Factor w/ 4 levels "(8,10]","(10,12]",..: 1 1 2 1 1 2 1 1 1 2 ...
## $ quality_group : Factor w/ 2 levels "(2,6]","(6,9]": 1 1 1 1 1 1 1 1 1 1 ...
Let’s first take a look at all the test varialbles, including the ratio.sulfur.dioxide and alcohol group we defined in previous section. The variables we want to further discover are: fixed.acidity, residual.sugar, pH, alcohol, quality, density.group, ratio.sulfur.dioxide and alcohol.group.
We’ll take a closer look at the plots above in the following sections.
From the fixed.acidity v.s. quality plot, we can tell the higher quality wines have a slightly lower fixed.acidity. But the variable v.s. quality plot not seems to be a propriate one in the bivariate plots. From the following plots, we cannot really summarize the relation between one variable alone with the quality score
Let’s now look at the relation between two variables. First plot is between residual.sugar and alcohol. Most of the points are in the left of the plot because few samples have large residual.sugar value. To better display the distribution, we take 99% quantile of residual.sugar data in the following plot.
We can see there is a relation between alcohol and residual.sugar when the residual.sugar is larger than 2. Larger residual.sugar value indicates relative smaller alcohal in the wines.
##
## Pearson's product-moment correlation
##
## data: density and alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
There is an almost linear relation between density and alcohal, the sample estimates cor value is -0.78.
from the fix.acidity and quality plot, we can tell the higher quality wines have a slightly lower fixed.acidity.
I observe a strong correlation between density and alcohol (almost linear). Higher density indicates lower alcohol.
It’s between desity and alcohol as described above
## 'data.frame': 4898 obs. of 17 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ density.group : Factor w/ 9 levels "(0.9917,0.9967]",..: 2 1 1 1 1 1 1 2 1 1 ...
## $ ratio.sulfur.dioxide: num 0.265 0.106 0.309 0.253 0.253 ...
## $ alcohol.group : Factor w/ 4 levels "(8,10]","(10,12]",..: 1 1 2 1 1 2 1 1 1 2 ...
## $ quality_group : Factor w/ 2 levels "(2,6]","(6,9]": 1 1 1 1 1 1 1 1 1 1 ...
From the previous section, we see density and alcohol are closely correlated. We add a second variable - quality to the plot. It’s not clear here with so many colors. But it seems that the higher quality(purple and pink) samples are in the upper-left corner, which indicates lower density and higher alcohol.
To reduce the colors in the plot, we create quality_group. We can see clearly on this plot that the higher quality score wines (green dots) are those with lower density and higher alcohol compared with low quality group
Similar quality group applies to alcohol-residual.sugar plot. No obvious findings here.
## # A tibble: 6 × 3
## quality alcohol.group n
## <int> <fctr> <int>
## 1 3 (8,10] 7
## 2 3 (10,12] 10
## 3 3 (12,14] 2
## 4 4 (8,10] 81
## 5 4 (10,12] 74
## 6 4 (12,14] 8
For different alcohol.group, we compare the distribution of the quality score. They all look like normal distribution. However, the mean position for the high alcohol group tends to shift right in the above plot.
## # A tibble: 6 × 4
## quality alcohol.group n freq
## <int> <fctr> <int> <dbl>
## 1 3 (8,10] 7 0.36842105
## 2 3 (10,12] 10 0.52631579
## 3 3 (12,14] 2 0.10526316
## 4 4 (8,10] 81 0.49693252
## 5 4 (10,12] 74 0.45398773
## 6 4 (12,14] 8 0.04907975
From the frequency bar plot, we can clearly see that higher alcohol group wines takes increasing proportion in higher quality score.
## # A tibble: 6 × 4
## quality density.group n freq
## <int> <fctr> <int> <dbl>
## 1 3 (0.9917,0.9967] 9 0.5625000
## 2 3 (0.9967,1.002] 7 0.4375000
## 3 4 (0.9917,0.9967] 108 0.7883212
## 4 4 (0.9967,1.002] 29 0.2116788
## 5 5 (0.9917,0.9967] 920 0.6789668
## 6 5 (0.9967,1.002] 432 0.3188192
Higher quality score wines in general are those with lower density and higher alcohol compared with low quality wines
Higher alcohol group wines takes increasing proportion in higher quality score.
The distribution of residual sugar in the wines are bimodal on log scale. For the low residual sugar distribution, they appear like normal distribution.
We plot the residual sugar and alcohol relation for all the wine samples. We can see these two variables are closely correlated to a almost linear relation. Then we assign different colors for different quality wines. The wines are devided into 2 quality group: with quality score between 3 - 6 (low quality) and score between 7 - 9 (high quality). An interesting finding is: high quality wines are more likely to be high in alcohol and low in residual sugar.
In this plot, we are trying to explore the relation between alcohol and different wine quality scores. We devided the alcohol into 4 groups (8-10%, 10-12%, 12-14% and 14-16%). In this plot, we only see a very tiny portion for 14-16% at wine quality 7, so we just ignore this group and focus on the other 3 alcohol groups. For 10-12% alcohol group, the proportion remains similar for all wine qualities. However, the higher alcohol group(12-14%) takes an increasing portion of the wine samples with higher wine quality.
The white wines dataset includes 4898 samples. For each sample, it includes 11 variables on the wines, like pH, density, alcohol, etc. And the quality score between 0 and 10 is also givin to each sample.
In the first section, I tried to plot the frequency plot for each variables to see if there is any abnormal distributions. Most of the variable distributions appeared more or less like normal distribution. However, the first plot for residual sugar was not clear, I changed the x scale to log10 and found the distribution of residual sugar in the wines are bimodal on log scale.
Then I explored the relation between two variables. Here was where I ran into difficulties. I couldn’t find any interesting relations when I plot each variable and the quality score in one figure, mainly because the quality score is descrete integer between 0 and 10. I decided to explore the variables and quality scores relation in multi-variable plot section instead of bi-variable plots. However, I did find an interesting almost linear relation between residual sugar and alcohol. As I added a third variable (quality group) to the plot, I found high quality wines are more likely to be high in alcohol and low in residual sugar. For the multi-variable plot section, I started with some interesting plots from the previous section and add a third variable to them.
There are definitely a lot more can be done on this dataset. For the first section, I found 1220 of 4898 wines are missing the density information in the database. If this information is given, we are able to explore more on the density influence to quality score. There are so many variables in the dataset that I didn’t come up with a correlation between the variable and quality scores. It’ll be great to work further to create a quality score prediction model based on the variables given.